37 research outputs found
SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
Diffusion models have achieved remarkable success in text-to-image
generation, enabling the creation of high-quality images from text prompts or
other modalities. However, existing methods for customizing these models are
limited by handling multiple personalized subjects and the risk of overfitting.
Moreover, their large number of parameters is inefficient for model storage. In
this paper, we propose a novel approach to address these limitations in
existing text-to-image diffusion models for personalization. Our method
involves fine-tuning the singular values of the weight matrices, leading to a
compact and efficient parameter space that reduces the risk of overfitting and
language-drifting. We also propose a Cut-Mix-Unmix data-augmentation technique
to enhance the quality of multi-subject image generation and a simple
text-based image editing framework. Our proposed SVDiff method has a
significantly smaller model size (1.7MB for StableDiffusion) compared to
existing methods (vanilla DreamBooth 3.66GB, Custom Diffusion 73MB), making it
more practical for real-world applications.Comment: Revised appendix with the addition of cross-attention regularization
for single-subject generatio
Diffusion Guided Domain Adaptation of Image Generators
Can a text-to-image diffusion model be used as a training objective for
adapting a GAN generator to another domain? In this paper, we show that the
classifier-free guidance can be leveraged as a critic and enable generators to
distill knowledge from large-scale text-to-image diffusion models. Generators
can be efficiently shifted into new domains indicated by text prompts without
access to groundtruth samples from target domains. We demonstrate the
effectiveness and controllability of our method through extensive experiments.
Although not trained to minimize CLIP loss, our model achieves equally high
CLIP scores and significantly lower FID than prior work on short prompts, and
outperforms the baseline qualitatively and quantitatively on long and
complicated prompts. To our best knowledge, the proposed method is the first
attempt at incorporating large-scale pre-trained diffusion models and
distillation sampling for text-driven image generator domain adaptation and
gives a quality previously beyond possible. Moreover, we extend our work to
3D-aware style-based generators and DreamBooth guidance.Comment: Project website: https://styleganfusion.github.io
Robust Conditional GAN from Uncertainty-Aware Pairwise Comparisons
Conditional generative adversarial networks have shown exceptional generation
performance over the past few years. However, they require large numbers of
annotations. To address this problem, we propose a novel generative adversarial
network utilizing weak supervision in the form of pairwise comparisons (PC-GAN)
for image attribute editing. In the light of Bayesian uncertainty estimation
and noise-tolerant adversarial training, PC-GAN can estimate attribute rating
efficiently and demonstrate robust performance in noise resistance. Through
extensive experiments, we show both qualitatively and quantitatively that
PC-GAN performs comparably with fully-supervised methods and outperforms
unsupervised baselines.Comment: Accepted for spotlight at AAAI-2
On the Importance of Calibration in Semi-supervised Learning
State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been
highly successful in leveraging a mix of labeled and unlabeled data by
combining techniques of consistency regularization and pseudo-labeling. During
pseudo-labeling, the model's predictions on unlabeled data are used for
training and thus, model calibration is important in mitigating confirmation
bias. Yet, many SOTA methods are optimized for model performance, with little
focus directed to improve model calibration. In this work, we empirically
demonstrate that model calibration is strongly correlated with model
performance and propose to improve calibration via approximate Bayesian
techniques. We introduce a family of new SSL models that optimizes for
calibration and demonstrate their effectiveness across standard vision
benchmarks of CIFAR-10, CIFAR-100 and ImageNet, giving up to 15.9% improvement
in test accuracy. Furthermore, we also demonstrate their effectiveness in
additional realistic and challenging problems, such as class-imbalanced
datasets and in photonics science.Comment: 24 page
DMCVR: Morphology-Guided Diffusion Model for 3D Cardiac Volume Reconstruction
Accurate 3D cardiac reconstruction from cine magnetic resonance imaging
(cMRI) is crucial for improved cardiovascular disease diagnosis and
understanding of the heart's motion. However, current cardiac MRI-based
reconstruction technology used in clinical settings is 2D with limited
through-plane resolution, resulting in low-quality reconstructed cardiac
volumes. To better reconstruct 3D cardiac volumes from sparse 2D image stacks,
we propose a morphology-guided diffusion model for 3D cardiac volume
reconstruction, DMCVR, that synthesizes high-resolution 2D images and
corresponding 3D reconstructed volumes. Our method outperforms previous
approaches by conditioning the cardiac morphology on the generative model,
eliminating the time-consuming iterative optimization process of the latent
code, and improving generation quality. The learned latent spaces provide
global semantics, local cardiac morphology and details of each 2D cMRI slice
with highly interpretable value to reconstruct 3D cardiac shape. Our
experiments show that DMCVR is highly effective in several aspects, such as 2D
generation and 3D reconstruction performance. With DMCVR, we can produce
high-resolution 3D cardiac MRI reconstructions, surpassing current techniques.
Our proposed framework has great potential for improving the accuracy of
cardiac disease diagnosis and treatment planning. Code can be accessed at
https://github.com/hexiaoxiao-cs/DMCVR.Comment: Accepted in MICCAI 202
Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Despite the success of fully-supervised human skeleton sequence modeling,
utilizing self-supervised pre-training for skeleton sequence representation
learning has been an active field because acquiring task-specific skeleton
annotations at large scales is difficult. Recent studies focus on learning
video-level temporal and discriminative information using contrastive learning,
but overlook the hierarchical spatial-temporal nature of human skeletons.
Different from such superficial supervision at the video level, we propose a
self-supervised hierarchical pre-training scheme incorporated into a
hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to
explicitly capture spatial, short-term, and long-term temporal dependencies at
frame, clip, and video levels, respectively. To evaluate the proposed
self-supervised pre-training scheme with Hi-TRS, we conduct extensive
experiments covering three skeleton-based downstream tasks including action
recognition, action detection, and motion prediction. Under both supervised and
semi-supervised evaluation protocols, our method achieves the state-of-the-art
performance. Additionally, we demonstrate that the prior knowledge learned by
our model in the pre-training stage has strong transfer capability for
different downstream tasks.Comment: Accepted to ECCV 202